透明对象对视觉感知系统提出了多个不同的挑战。首先,他们缺乏区分视觉特征使透明对象比不透明的对象更难检测和本地化。即使人类也发现某些透明的表面几乎没有镜面反射或折射,例如玻璃门,难以感知。第二个挑战是,通常用于不透明对象感知的常见深度传感器由于其独特的反射特性而无法对透明对象进行准确的深度测量。由于这些挑战,我们观察到,同一类别(例如杯子)内的透明对象实例看起来与彼此相似,而不是同一类别的普通不透明对象。鉴于此观察结果,本文着手探讨类别级透明对象姿势估计的可能性,而不是实例级姿势估计。我们提出了TransNet,这是一种两阶段的管道,该管道学会使用局部深度完成和表面正常估计来估计类别级别的透明对象姿势。在最近的大规模透明对象数据集中,根据姿势估计精度评估了TransNet,并将其与最先进的类别级别姿势估计方法进行了比较。该比较的结果表明,TransNet可以提高透明对象的姿势估计准确性,并从随附的消融研究中提高了关键发现,这表明未来的方向改善了绩效。
translated by 谷歌翻译
在分支机构和结合中得出良好的可变选择策略对于现代混合编程(MIP)求解器的效率至关重要。通过在先前的解决方案过程中收集的MIP分支数据,学习分支方法最近变得比启发式方法更好。由于分支机构自然是一项顺序决策任务,因此应该学会优化整个MIP求解过程的实用性,而不是在每个步骤上都是近视。在这项工作中,我们将学习作为离线增强学习(RL)问题进行分支,并提出了一种长期视线的混合搜索方案来构建离线MIP数据集,该数据集对分支决策的长期实用程序。在政策培训阶段,我们部署了基于排名的奖励分配计划,以将有希望的样本与长期或短期视图区分开,并通过离线政策学习训练名为分支排名的分支模型。合成MIP基准和现实世界任务的实验表明,与广泛使用的启发式方法和基于先进的学习分支模型相比,分支rankink更有效,更健壮,并且可以更好地概括为MIP实例的大型MIP实例。
translated by 谷歌翻译
透明的物体在家庭环境中无处不在,并且对视觉传感和感知系统构成了不同的挑战。透明物体的光学特性使常规的3D传感器仅对物体深度和姿势估计不可靠。这些挑战是由重点关注现实世界中透明对象的大规模RGB深度数据集突出了这些挑战。在这项工作中,我们为名为ClearPose的大规模现实世界RGB深度透明对象数据集提供了一个用于分割,场景级深度完成和以对象为中心的姿势估计任务的基准数据集。 ClearPose数据集包含超过350K标记的现实世界RGB深度框架和5M实例注释,涵盖了63个家用对象。该数据集包括在各种照明和遮挡条件下在日常生活中常用的对象类别,以及具有挑战性的测试场景,例如不透明或半透明物体的遮挡病例,非平面取向,液体的存在等。 - 艺术深度完成和对象构成清晰度上的深神经网络。数据集和基准源代码可在https://github.com/opipari/clearpose上获得。
translated by 谷歌翻译
视觉感知任务通常需要大量的标记数据,包括3D姿势和图像空间分割掩码。创建此类培训数据集的过程可能很难或耗时,可以扩展到一般使用的功效。考虑对刚性对象的姿势估计的任务。在大型公共数据集中接受培训时,基于神经网络的深层方法表现出良好的性能。但是,将这些网络调整为其他新颖对象,或针对不同环境的现有模型进行微调,需要大量的时间投资才能产生新标记的实例。为此,我们提出了ProgressLabeller作为一种方法,以更有效地以可扩展的方式从彩色图像序列中生成大量的6D姿势训练数据。 ProgressLabeller还旨在支持透明或半透明的对象,以深度密集重建的先前方法将失败。我们通过快速创建一个超过1M样品的数据集来证明ProgressLabeller的有效性,我们将其微调一个最先进的姿势估计网络,以显着提高下游机器人的抓地力。 ProgressLabeller是https://github.com/huijiezh/progresslabeller的开放源代码。
translated by 谷歌翻译
用于对象检测的常规知识蒸馏(KD)方法主要集中于同质的教师学生探测器。但是,用于部署的轻质检测器的设计通常与高容量探测器显着不同。因此,我们研究了异构教师对之间的KD,以进行广泛的应用。我们观察到,异质KD(异核KD)的核心难度是由于不同优化的方式而导致异质探测器的主链特征之间的显着语义差距。常规的同质KD(HOMO-KD)方法遭受了这种差距的影响,并且很难直接获得异性KD的令人满意的性能。在本文中,我们提出了异助剂蒸馏(Head)框架,利用异质检测头作为助手来指导学生探测器的优化以减少此间隙。在头上,助手是一个额外的探测头,其建筑与学生骨干的老师负责人同质。因此,将异源KD转变为同性恋,从而可以从老师到学生的有效知识转移。此外,当训练有素的教师探测器不可用时,我们将头部扩展到一个无教师的头(TF-Head)框架。与当前检测KD方法相比,我们的方法已取得了显着改善。例如,在MS-COCO数据集上,TF-Head帮助R18视网膜实现33.9 MAP(+2.2),而Head将极限进一步推到36.2 MAP(+4.5)。
translated by 谷歌翻译
几乎所有的多代理强化学习算法没有交流,都遵循分散执行的集中培训原则。在集中培训期间,代理可以以相同的信号为指导,例如全球国家。但是,在分散执行期间,代理缺乏共享信号。受到观点不变性和对比学习的启发,我们在本文中提出了共识学习,以学习合作的多代理增强学习。尽管基于局部观察结果,但不同的代理可以在离散空间中推断出相同的共识。在分散执行期间,我们将推断的共识作为对代理网络的明确输入提供了,从而发展了他们的合作精神。我们提出的方法可以扩展到具有小模型更改的各种多代理增强学习算法。此外,我们执行一些完全合作的任务,并获得令人信服的结果。
translated by 谷歌翻译
Unsupervised domain adaptation (UDA) for semantic segmentation is a promising task freeing people from heavy annotation work. However, domain discrepancies in low-level image statistics and high-level contexts compromise the segmentation performance over the target domain. A key idea to tackle this problem is to perform both image-level and feature-level adaptation jointly. Unfortunately, there is a lack of such unified approaches for UDA tasks in the existing literature. This paper proposes a novel UDA pipeline for semantic segmentation that unifies image-level and feature-level adaptation. Concretely, for image-level domain shifts, we propose a global photometric alignment module and a global texture alignment module that align images in the source and target domains in terms of image-level properties. For feature-level domain shifts, we perform global manifold alignment by projecting pixel features from both domains onto the feature manifold of the source domain; and we further regularize category centers in the source domain through a category-oriented triplet loss and perform target domain consistency regularization over augmented target domain images. Experimental results demonstrate that our pipeline significantly outperforms previous methods. In the commonly tested GTA5$\rightarrow$Cityscapes task, our proposed method using Deeplab V3+ as the backbone surpasses previous SOTA by 8%, achieving 58.2% in mIoU.
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译
Increasing research interests focus on sequential recommender systems, aiming to model dynamic sequence representation precisely. However, the most commonly used loss function in state-of-the-art sequential recommendation models has essential limitations. To name a few, Bayesian Personalized Ranking (BPR) loss suffers the vanishing gradient problem from numerous negative sampling and predictionbiases; Binary Cross-Entropy (BCE) loss subjects to negative sampling numbers, thereby it is likely to ignore valuable negative examples and reduce the training efficiency; Cross-Entropy (CE) loss only focuses on the last timestamp of the training sequence, which causes low utilization of sequence information and results in inferior user sequence representation. To avoid these limitations, in this paper, we propose to calculate Cumulative Cross-Entropy (CCE) loss over the sequence. CCE is simple and direct, which enjoys the virtues of painless deployment, no negative sampling, and effective and efficient training. We conduct extensive experiments on five benchmark datasets to demonstrate the effectiveness and efficiency of CCE. The results show that employing CCE loss on three state-of-the-art models GRU4Rec, SASRec, and S3-Rec can reach 125.63%, 69.90%, and 33.24% average improvement of full ranking NDCG@5, respectively. Using CCE, the performance curve of the models on the test data increases rapidly with the wall clock time, and is superior to that of other loss functions in almost the whole process of model training.
translated by 谷歌翻译